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Method for Alignment of DNA Sequences with 
Enhanced Accuracy and Read Length 

Background of the Invention 

This application relates to DNA sequencing technology and in particular to a 
method for alignment of DNA sequences which provides enhanced accuracy and read-length. 
5 DNA sequencing is generally performed today using one of two methodologies: 

the chemical degradation method or the chain termination method. Of these, the chain 
termination method originally described by Sanger et al. ? Proc. Natl Acad. Sci. USA 74: 5463- 
5467 (1977) or variations thereof have been adopted in many cases for development of 
P automated sequencing instruments and protocols. 

lS In the chain termination sequencing method, fragments are generated using chain 

j5 termination reagents in a template-dependant polymerization reaction. The lengths of the 
f " fragments indicate the positions of one species of base in a target polynucleotide. If fragment 

sets are generated for each of the four species of bases (A, C, G and T), analysis of the fragment 

sizes permits the explicit determination of the sequence of the target polynucleotide. While the 
V§ translation of this conceptual methodology into practice is effective for determination of 
; p sequences, the application in automated systems has faced numerous challenges. These include 
p the fact that the band shape produced following electrophoresis of real fragments is not 

consistent from one band to the next and may not be perfectly straight (smiling may occur); 

variations which can occur in peak spacing from one lane of a gel to the next; variations in peak 
'20 spacing which can occur as the length of the run increases; and decreases in resolution which 
^ occur as the length of the run increases. Furthermore, since much of the cost associated with 

DNA sequencing is in the set-up time involved, for clinical and diagnostic applications the larger 

the length of DNA which can be sequenced with accuracy, the smaller the per patient cost can be. 

These considerations have led to a variety of proposals for improving the chemistry used in 
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sequencing, or for improving the manner in which data representing the detected sequencing 
fragment is processed. The present invention relates to the second type of improvement. 

In order to obtain meaningful sequence information from raw data obtained by 
electrophoresis of labeled sequencing fragments, one of the most important factors is the 
5 alignment of the data traces representing each species of base. In non-automated systems, this is 
frequently done by eye-ball, and the eye of a skilled technician is in fact a remarkable tool for this 
purpose. Commonly assigned US Patent No. 5,916,747, which is incorporated herein by 
reference, discloses a method for aligning data traces from four channels of an automated 
electrophoresis detection apparatus in which each channel detects the products of one of four 
10 chain-termination DNA sequencing reactions such that the four channels together provide 
M , information concerning the sequence of all four bases within a nucleic acid polymer being 
yl analyzed. The method places the four data traces in a trial alignment, and then determines 
=i coefficients of shift and stretch for selected data points within each normalized data trace to 

optimize a cost function which reflects the extent of overlap of peaks in the combined 
T# normalized data traces to which the coefficients have been applied. Warp functions are then 
□ generated for the normalized data traces from the coefficients of shift and stretch determined for 
j?! the selected data points, and applied to the respective data trace to produce four warped data 
=p traces which are assembled to form an aligned data set. This data set is then used for base-calling 
S to complete the sequence determination process. 

20 The procedure of the '747 patent is generally suited for the determination of 

sequences where explicit data for the positions of all four bases are obtained. On the other hand, 
it is not always necessary to determine the positions of all of four species of bases in order to 
obtain diagnostic information from a given polynucleotide. (See, commonly assigned US Patent 
No. 5,834,1 89, which is incorporated herein by reference). Commonly assigned US Patent No. 

25 5,853,979, which is incorporated herein by reference discloses a method for the interpretation of 
experimental fragment patterns for polynucleotides having putatively known sequences. In this 
method, at least one raw fragment pattern representing the positions of a selected nucleotide base 
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as a function of migration time or distance is obtained for the experimental sample. The 
fragment pattern is evaluated to determine one or more "normalization coefficients." These 
normalization coefficients reflect the displacement, stretching or shrinking, and rate of stretching 
or shrinking of the clean fragment, or segments thereof, which are necessary to obtain a suitably 
5 high degree of correlation between the clean fragment pattern and a standard fragment pattern 
which represents the positions of the selected nucleic acid base within a standard polymer 
actually having the known sequence as a function of migration time or distance. The normali- 
zation coefficients are then applied to the fragment pattern to produce a normalized fragment 
pattern which is used for base-calling in a conventional manner. As indicated, however, this 

10 technique requires prior knowledge of the expected fragment pattern for the polynucleotide being 

y analyzed. 

j If Notwithstanding such techniques, there remains room for improvement in the 

S manner in which automated analysis of sequencing fragment patterns are carried out. In 

particular, there remains a need for systems which allow enhanced read-length, i.e., the analysis 
H of a greater number of bases in a single lane of a gel, without loss of accuracy or substantial 
H increase in analysis time. It is an object of the present invention to provide a method which 
!?! answers this need. 



□ Summary of the Invention 

20 The present invention provides a method for aligning sequence data traces. In 

accordance with the invention, an experimental data trace representing the positions of a first 
species of base within a target polynucleotide and a reference data trace representing the 
positions of a second species of base (which maybe the same as or different from the first 
species) within a reference polynucleotide are obtained by separating appropriate sequencing 

25 fragments generated from the target and reference polynucleotides in a common lane of an 
electrophoresis gel. For each reference data trace, a plurality of peaks corresponding to 
fragments having a size in the range of 40 to 1200 bases are selected. A base number is assigned 
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to each of the selected peaks in the reference data trace, and a numerical "peak file" is created 
with information about the peak number and migration time (or distance). This peak file is 
analyzed to determine a set of polynomial coefficients which will allow substantial linearization 
of a plot of peak number versus separation between adjacent peaks and alignment of the traces 
5 with respect to each other. These coefficients are used to create a corrected time scale identifying 
where peaks should be located on a given experimental gel. This corrected time scale is used to 
guide the sampling of the experimental data, and for assignment of peaks within the data. 

Brief Description of the Drawings 
10 Fig. 1 shows a plot of peak spacing versus peak number for unaligned data, and 

M data aligned with third and fifth order polynomials; 

!H Fig. 2 shows a plot of peak spacing versus peak number for data aligned with 

m third, fourth and fifth order polynomials; 

;"7 Figs. 3A and B show plots of the difference, for each lane, between the run time 

H of a base (322nd nt) and its average value for all 16 lanes of a gel. Fig. 3 A corresponds to the 
q run time difference in the raw data; Fig. 3B is the run time difference after alignment; 
! J! Fig. 4 shows the relationship between accuracy and read length for a first set of 

experimental data which was well-aligned on the gel; 
□ Fig. 5 shows the relationship between accuracy and read length for a first set of 

20 experimental data which was poorly-aligned on the gel; and 

Fig. 6 shows a system in accordance with the invention. 

Detailed Description of the Invention 

The present invention provides a method for linearization and alignment sequence 
25 data traces. As used herein, the term "linearize" refers to establishing equal spacing in a time 
domain between adjacent peaks within the overall sequence in an experimental data trace. The 
term "align" refers to establishing the correct positions within the overall sequence for the peak 
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in an experimental data trace. When a data trace is obtained for each of the four bases, the 
alignment process results in an explicit determination of the position of each and every base. 
However, since in some instances it is not necessary to perform all four sequencing reactions and 
analyze the results to obtain useful diagnostic data, "alignment" can be performed on a single 
5 trace, representing the positions of a single species of nucleotide base within a target 

polynucleotide. In this case, the single trace after linearization is "aligned" with a standard time 
scale, to show the base numbers associated with peaks within the linearized trace. Alignment can 
also be performed on data sets of two or more traces representing the positions of two or more 
species of nucleotide bases within the target polynucleotide. 
1 0 The process of linearization and alignment is essentially one of assigning a correct 

S numerical position to each of the bases. An important aspect of the linearization and alignment 
W process is compensation for variation in peak spacing which occurs over time even within a 
m single lane of an electrophoresis gel. The present invention performs this compensation by co- 
;T electrophoresing a reference sequence with the experimental sequence and utilizing the resulting 
rt reference data trace to define the correct peak spacing. 

□ The specification and claims of this application use the term "DNA sequencing 

h1 fragments" to describe the mixture of polynucleotides which results when chain extension 
± polymerization is performed in the presence of a chain-terminating base analog, such as a 
p dideoxynucleotide triphosphate. The term "DNA sequencing fragments" only requires the 
20 presence in the mixtures of fragments the lengths of which are indicative of the positions of one 

type of base within the polynucleotide being analyzed. 

In the simplest embodiment of the invention, experimental and reference data 

traces obtained from a single lane of an electrophoresis gel are evaluated. The experimental 

polnucleotide may be, for example, the A-sequencing fragments generated from a target 
25 polynucleotide of interest. The reference sample is, for example, the T-sequencing fragments 

generated from a reference polynucleotide of known sequence. Preferably, the reference 
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polynucleotide is of similar total length to the experimental polynucleotide so that the reference 
data extends over the entire length of the experimental sequence information. 

Because the reference polynucleotide has a known sequence, it is possible to 
immediately create a peak table having two columns: actual retention time and peak number. 
5 Thus, for example, if the sequence were 9 bases long, and had the sequence ACATTACGA, then 
the data trace derived from the A-sequencing fragment would have four peaks appearing at times 
T 1? T 2> T 3 and T 4 , respectively. The peak table would therefore appear as follows: 

Ti 1 
T 2 3 
10 T 3 6 

!!■ T 4 9 

U! If the spacing of the peaks in the gel over this region were exactly the same, then a plot of T 
m versus peak number would produce a straight line, and a plot of the spacing (the difference 
between each adjacent peak) versus peak number would produce a straight, horizontal line. 
f§ Because experimental data does not meet this ideal, however, the result is in fact far different. 
Q Thus, as shown in Fig. 1, the experimental spacing between adjacent peaks as a function of base 
;f! number may follow a complex curve, at first increasing through a maximum, and then decreasing 
;iN again. 

A In accordance with the present invention, a curve fitting procedure is applied to 

20 the raw reference data trace in which the data is fit to a polynomial, generally a third or higher- 
order polynomial. Although this fitting process is generally performed in actual practice using a 
computer program and any of various known curve fitting programs, the procedure employed can 
be understood from the discussion below. In the unaligned data, one is essentially plotting the 
function 

25 AT-mP + c 

where AT is the spacing between adjacent peaks (in units of time), m is the slope of the line, c is 
a constant which is characteristic of the gel and which reflects the characteristic peak spacing, 
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and P is peak number. In ideal data, the slope m is 0, such that there is no actual relationship 
between AT and P, and AT is simply a constant. As one can see in Fig. 1, the experimental data 
are far from being a straight line. In this case, the experimental curve can be approximated by a 
polynomial. The empirical curve aT is fit to a polynomial function by a least squares method: 

AT=a l0 +a 11 P + a l2 P 2 + ... + a ik P k . 

The degree (k) of the polynomial is an input parameter of the fitting program. The procedure 
generates a set of coefficient {a lk } for each gel lane (i). A curve fitting program identifies the 
coefficients, a,, (which may be positive or negative) and the constant a 0 which bring the resulting 
plot of the reference data closest to a straight line. Based on the set of polynomial coefficients, 
{a^}, a corrected time scale is defined for each peak in gel lane #i, according to the formula 

T ip = Q [a l0 + a u t ip P + a l2 t ip P 2 + ... + a lk t ip P k ], 
where T ip is the corrected time value for the reference peak of length p, t ip is the experimentally 
measured run time of this peak, and Q is a scaling factor. This transformation causes the spacing 
between consecutive peaks in the corrected time domain (dT ip /dp) to remain constant over the 
course of the run. The transformation (linearization) is performed for both the reference peaks 
and the sample peaks in each gel lane (i). 

Each gel lane has a different scaling factor, Q. For any particular gel, the set of 
values {C,} is chosen to equalize the spacing between consecutive peaks in the corrected time 
domain, (dT ip /dp), across all lanes of the gel. A gel lane is uniformly compressed by setting Q < 
1, and it is uniformly stretched by setting C 1 > 1. A set of coefficients {CJ is therefore defined, 
such that all lanes of the gel have the same total run time in the correct time domain. In the 
dimension of real time, the data points are evenly spaced. However, in the dimension of 
"corrected time", the corresponding time intervals are not of equal lengths. Therefore the 
experimental data sets are "resampled" into equally-spaced values (in the corrected time domain) 
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by quadratic interpolation. Resampling of the data set for each gel lane is done separately, 
because the corrected time scale may be different for each lane. The procedure described above 
is a global alignment, which precedes any subsequent local alignment by the base calling 
software. This global alignment procedure is general, and should be compatible with all types of 
5 local alignment algorithms. 

The basic methodology described above for alignment of a single data trace can 
also be applied in other embodiments. For example, data can be obtained for all four bases (A,C, 
G and T) in four lanes, to obtain explicit position information for the complete sequence of a 
target polynucleotide. In this case, a set of reference sequencing fragments is desirably run in 
10 each of the four lanes. Further, in multi-lane gels, it is desirable to run a set of reference 
H sequencing fragments in each lane, regardless of the nature of the experimental samples. If a 
m sequencing apparatus is used that is capable of distinguishing between more than two labels, 
m multiple experimental sets of sequencing fragments may be run in one lane along with a set of 
f~" reference sequencing fragments. In each case where more than one reference data trace is 
IP obtained from a gel, the spacings of all the reference data traces can be combined to produce a 
v=% single set of coefficients and single characteristic spacing which is applied to all of the 
^ experimental data traces from the gel. 

s p Several features are common to all of the various embodiments discussed above. 

S Each set of the experimental sequencing fragments and the reference sequencing fragments are 

20 labeled with a distinguishable labels, i.e, the labels on the experimental fragments and reference 
frgaments are different from one another when they are present in the same lane of thegel. The 
nature of the labels is a matter of choice and compatability with the detection system employed. 
Suitable labels include radiolabels, chromophores, chromogenic labels and fluorogenic labels. 
Preferred labels, however, are fluorescent labels compatible with automated multi-dye 

25 sequencers. Specific examples of suitable fluorescent labels include cyanine dyes such as Cy5.0 
and Cy5.5 (See US Patents Nos. 4,981,977 and 5,268,486) and energy transfer dyes (U.S. Patent 
No. 5,800,996) and rhodamine dyes (U.S. patents Nos. 5,366,860 and 4,855,225). 
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There is no required relationship between the target polynucleotide and the 
reference polynucleotide, and it is not mandatory that the same set of reference sequencing 
fragments be used in all of the lanes of a gel. This is the case because the alignment depends on 
the measured position of the known bases of the reference trace, but not on the identity of the 
5 bases. However, the reference polynucleotide should be selected to provide enough peaks (or 
bands) to facilitate the use of a desirable dgeree of polynomial for fitting the experimental data. 
For example, if one wishes to use a 5th-degree polynomial, the reference polynucleotide must 
provide at least 6 peaks. 

Furthermore, while it is necessary to know the sequence of the reference 
10 polynucleotide for the creation of the initial peak table, it is not necessary to have any a priori 
l S knowledge of the sequence of the target polynucleotide. Thus, while the present invention is 
m particularly applicable to diagnostic applications where the putative sequence of the target 
ffl polynucleotide is known, it is not limited to such applications. 

\2 A further factor which can be adjusted by the user is the number of peaks within 

H the reference data trace that are used in determining the polynomial coefficients and 
P characteristic spacing. While all of the peaks can be considered, this increases the processing 
If! time and burden. As a practical matter, a much smaller number of peaks can be utilized and still 
£ provide good alignment of the experimental data traces. For example, for alignment of 
O sequencing fragments spanning 40 to 1 ,200 bases, from 3 to 40 peaks in the reference data trace 
20 are suitably selected. The selected peaks are preferably distributed fairly evenly throughout the 
reference data trace, although precisely equal distribution is not required. 

Fig. 6 shows a schematic representation of an apparatus in accordance with the 
present invention for evaluating the sequence of a target polynucleotide. The apparatus as shown 
comprises a processor housing 10 which has an input 1 1 for receiving information about one or 
25 more experimental DNA sequencing data traces derived from the separation of experimental 
DNA sequencing fragments reflecting the position of at least one base in the target 
polynucleotide and one or more reference DNA sequencing data traces derived from the 
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separation of reference DNA sequencing fragments reflecting the position of at least one base in 
a reference polynucleotide of known sequence. For example, input 1 1 may be in the form of a 
wire for transmitting sequence-related data from a sequencer. Data could also be transmitted via 
a wireless link, or communicated to the apparatus through disk drive 13. 

Within the housing 10 is a data processing apparatus 14 which include one or 
several processors. The processors or processors are operatively programmed 

(a) to evaluate the reference DNA sequencing data traces to determine a 
corrected time scale indicative of migration times at which peaks should occur; 

(b) to sample the experimental DNA sequencing data traces at time points 
determined by the corrected time scale; and 

(c) to assign a base number to each peak found in the experimental DNA 
sequencing data traces based upon the corrected time scale, thereby obtaining information about 
the sequence of the target polynucleotide. The assigned base numbers may be further processed 
to provide an output indicative of information about the sequence of the target polynucleotide 
and this information is communicated to the user via an output device. Exemplary output 
devices are a display 15 or printer 16. The information may also be communicated by saving it 
to the disk drive 13 (which can function as either an input or an output device) or through a 
telecommunication connection (such as a modem or internet connection). 

In an embodiment of the invention, the processor programmed to evaluate the 
reference DNA sequence data traces is programmed to perform the steps of: 

(i) identifying a plurality of peaks in the reference DNA sequencing data 
traces, and creating a data table containing the number of each peak based on the known 
sequence of the polynucleotide, and the position of each peak in the reference DNA sequencing 
data trace; 

(ii) identifying a set of coefficients for a polynomial effective to substantially 
linearize a plot of peak number versus separation between adjacent peaks; and 
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(iii) creating from the coefficients and the polynomial a corrected time scale 
which reflects the positions at which a peak should occur at any given point in a sequencing data 
trace. 

The invention will now be further described and illustrated with reference to the 
following, non-limiting examples. 

Example 1 

Lanes 1, 5, 9 and 13 of a standard 16 lane MICROCEL™ electrophoresis gel 
(Visible Genetics Inc.) were loaded with a mixture of the A-terminated sequencing fragments 
from Ml 3 labeled with CY5.0 fluorescent cyanine dye label as the experimental sample, and T- 
terminated sequencing fragments from M13 labeled with CY5.5 fluorescent cyanine dye label as 
the reference sequence. Lanes 2, 6, 10 and 14 were loaded with a mixture of C-terminated 
sequencing fragments from Ml 3 labeled with CY5.0 fluorescent cyanine dye label as the 
experimental sample and CY5.5-labeled M13 T's as the reference sequence fragments. Lanes 3, 
7, 1 1 and 15 were loaded with a mixture of G-terminated sequencing fragments from Ml 3 
labeled with CY5.0 fluorescent cyanine dye label as the experimental sample and CY5.5-labeled 
M13 T's as the reference sequence. Lanes 4, 8, 12 and 16 were loaded with a mixture of T- 
terminated sequencing fragments from Ml 3 labeled with CY5.0 fluorescent cyanine dye label as 
the experimental sample and CY5.5-labeled M13 T's as the reference sample. The reference 
sequence and the experimental sequence in this example are derived from the same source, and 
indeed in the case of the T-terminated sequencing fragments are identical to the reference 
sequence except for the difference in label. However, the good results for alignment and 
linearization indicate that the reference sequence does not have to be related to the experimental 

sequence in any way. 

The labeled DNA molecules were separated by electrophoresis and detected using 
a 638 nm laser excitation source which was detected in real time. The data collection was 
performed on a 2-color DNA Sequencer (Visible Genetics Inc.), to record two channels for each 
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physical lane, one channel reflecting detection of the CY5.0 label affixed to the experimental 
sequencing fragments and one channel reflecting detection of the CY5.5 label affixed to the 
reference fragments. Collected data from the two channels were corrected for overlap in the 
emission spectra of the two labels and the two resulting data traces were saved as a "data file." 

5 Data analysis on the data file was performed using special software in accordance with the 
protocols of the present invention. 

For each reference channel, several peaks (from 3 to 40 in different experiments) 
were identified having sizes in the range from 40 to about 1200 bases. The base number as 
assigned to each of these peaks based on knowledge of the sequence of the reference sample, and 

1 0 the position of each peak in the time scale of the experiment was determined. The information 

H about these peaks in the form of a base number and a peak position (or time) was stored in a 

m "peak file." 

m To align the raw data stored in the data file, the peak data was used to calculate 

\* the standard number of bases per unit time as an average over the 16 reference channels. The 
B data was fit to 3 rd and 5 th order polynomials expressing the relationship base number and peak 
□ position. Using the fitted polynomial, a corrected time scale was created, so that the reference 
[J: peaks are equally spaced in the corrected time and have the same origin. The number of bases 

per unit corrected time is constant for all the data in the run. However, the actual time interval 
13 between peaks is not generally constant. Thus, the corrected time scale is used to resample the 
20 experimental data trace and the associated reference channel. This procedure essentially invovles 

looking at the experimental at the times specified by the corrected time scale, and determining 

whether or not a peak is present at the correct time. 

Figs. 1 and 2 illustrate the application of the invention to the specific sequences 

described above. Fig,. 1 shows the spacing between adjacent bases as a function of base 
25 number, for non-aligned (raw) data (closed diamonds), and data aligned and linearized using a 3 rd 

order (open triangles) and 5 th order (open circles) polynomials. . It is clearly seen that the spacing 
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is changing during the run significantly, but is linearized by fitting with either the 3 rd or 5 th order 
polynomial. 

Fig. 2 illustrates the influence of the order of the polynomial used for fitting the 
raw data of the experimental traces. Increasing the polynomial from 3 rd order (open triangles) to 
4 th order (closed diamonds) improves linearity noticeably, although the curve still have a 
nonlinear part in the beginning of the run (up to about 100 bases). The 5 th order polynomial 
(open circles) gives the best result, with the maximum deviation from the straight line being less 
than about 0.5 seconds up to 1300 bases. Such linearity is close to the limit in this particular 
experiment, because the sampling time was 0.5 seconds. Thus, further increase in the order of 
the polynomial would only increase computational time, without being likely to provide any 
significant improvement in linearity. 

Figs. 3 A and B illustrate improvement in the alignment of the sequencing data 
(from trace to trace) based on the procedure of the invention. Fig. 3 A shows raw data. The 
difference in run time can reach 500 seconds. Alignment of the raw data, even with a 3rd-degree 
polynomial, improves the data significantly, reducing the difference in run time to a maximum of 
~ 90 seconds. (See. Fig. 3B) When a 5th-degree polynomial is used, the difference becomes less 
than 10 seconds. 

Example 2 

Raw data traces were generated using Ml 3 T-terminated sequencing fragments in 
four adjacent lanes of a sequencing gel. As noted in Table 1, the raw, unaligned data traces 
showed the substantial variability in peak position that can be observed. Application of a 5 th 
order polynomial to this data to determine a corrected time scale, and the application of this time 
scale to the raw data traces, resulted in a substantial improvement in the alignment of the data. 
This improved alignment allows the calling of bases with greater accuracy over the entire 1300 
bases length. 
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Table 1 


Peak Number 


Reparation between mgn ana 
low time peaks, before 
alignment 


C Anorati An Kphx/PPii Viiorn dnn 
oepaiallOIl DclVVCCIl Illgll allU 

low time peaks, after 
alignment 


40 


1 min 16 sec 


14 sec 


140 


2 min 5 sec 


4 sec 


312 


9 min 30 sec 


6 sec 


607 


41 min 38 sec 


4 sec 


970 


almost 1.5 hours 


24 sec 



Example 3 

|f| To understand the significance of the number of peaks incorporated in the peak 

W file for use in generating the polynomial, the data from a Terminated Ml 3 fragment set was 
U processed using 3, 5, 10, 20 and 40 selected peaks, and the spacing between adjacent peaks at 
!"* various base positions after alignment was determined. The results are shown in Table 2. As can 
|3 be seen higher numbers of peaks reduce the extent of variation in peak spacing, although even as 
W few as 3 peaks provides useful results. Comparison of the results from 10, 20 and 40 peaks 
K suggests that an increase beyond 40 would only add to the computational burden without 
M improving the quality of the result. 

Example 4 

20 To evaluate the ability of the linearization and alignment processes of the 

invention provide a demonstrable improvement in base calling accuracy and read length, Ml 3 
sequence was used. CY5.0-labeled A, C, G and T-terminated sequence fragments were used as 
experimental samples, while Ml 3 T's labeled with CY5.5 were used as the reference sample. 
Base-calling was performed on the raw data, and on the data after alignment based on 40 peaks 

25 of the reference trace. 
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Table 2 



Spacing . between adj. peaki 
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The relationship between accuracy and read length for each of these two 
experiments is shown in Figs. 4 and 5, respectively. As shown in Fig. 4, for a given accuracy 
(for example 97%), data alignment based on information from a reference channel allows 
increase in read length for at least 10%, i.e., for another 100 bases to be accurately read. 
5 Alternatively, for a given read length (for example 900 bases), it provides improved accuracy 
(98.5% from 97%). These conclusion are based on results of base calling for lanes that were 
relatively well-aligned to begin with. For channels which experience a large shift in the raw data, 
the effect of alignment in accordance with the invention is more pronounced. (Fig. 5). Thus, in 
this experimental system without alignment it is possible to call only 100 bases with reasonable 
10 accuracy. After alignment, however, up to 1000 bases can be called. 
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What is Claimed is: 



1 1 . A method for assignment of base numbers to peaks within an experimental 

2 DNA sequencing data trace derived from the separation of experimental DNA sequencing 

3 fragments, comprising the steps of: 

4 (a) obtaining one or more reference DNA sequencing data traces derived from 

5 the separation of reference DNA sequencing fragments reflecting the position of at least one base 

6 in a reference polynucleotide of known sequence; 

7 (b) evaluating the reference DNA sequencing data traces to determine a 

8 corrected time scale indicative of migration times at which peaks should occur; 

y (c) sampling the experimental DNA sequencing data trace at time points 

1JQ determined by the corrected time scale, and 

U (d) assigning a base number to each peak found in the experimental DNA 

O" sequencing data trace based upon the corrected time scale. 

5=| 2. The method of claim 1 , wherein the step of evaluating the reference DNA 

%_ sequence data traces includes the steps of: 

;§ (i) identifying a plurality of peaks in the reference DNA sequencing data 

5 traces, and creating a data table containing the number of each peak based on the known 

5 sequence of the polynucleotide, and the position of each peak in the reference DNA sequencing 

6 data trace; 

7 (ii) identifying a set of coefficients for a polynomial effective to substantially 

8 linearize a plot of peak number versus separation between adjacent peaks; and 

9 (iii) creating from the coefficients and the polynomial a corrected time scale 

1 0 which reflects the positions at which a peak should occur at any given point in a sequencing data 

1 1 trace. 
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1 3 , The method of claim 1 , wherein the experimental DNA sequencing data 

2 trace and a first reference DNA sequencing data trace are derived from analysis of sequencing 

3 fragments in a common lane of a sequencing gel. 

1 4. The method of claim 1 , wherein a plurality of reference DNA sequencing 

2 data traces are obtained, each derived from the separation of the same set of reference DNA 

3 sequencing fragments. 

1 5 . The method of claim 1 ? wherein the polynomial is a third or higher order 

2 polynomial. 

\M 6. The method of claim 1 , wherein a defined number of bands are selected 

X for evaluation from each of the reference DNA sequencing data traces. 

H 7. The method of claim 6 ? wherein the defined number of bands selected is 

3 from 3 to 40. 

:|| 8. The method of claim 6, wherein the defined number of bands is at least 

13 equal to the order of the polynomial, plus 1 . 

1 9. The method of claim 1 , wherein base numbers are assigned to peaks 

2 within a plurality of experimental DNA sequencing data traces derived from the separation of 

3 experimental DNA sequencing fragments indicative of the positions of a plurality of types of 

4 bases. 
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1 1 0. The method of claim 9, wherein base numbers are assigned to peaks 

2 within four experimental DNA sequencing data traces derived from the separation of 

3 experimental DNA sequencing fragments indicative of the positions of four types of bases. 

1 1 1 . A method for evaluating the sequence of a target polynucleotide, 

2 comprising the steps of: 

3 (a) obtaining one or more experimental DNA sequencing data traces derived 

4 from the separation of experimental DNA sequencing fragments reflecting the position of at least 

5 one base in the target polynucleotide and one or more reference DNA sequencing data traces 

6 derived from the separation of reference DNA sequencing fragments reflecting the position of at 
H least one base in a reference polynucleotide of known sequence; 

\g (b) evaluating the reference DNA sequencing data traces to determine a 

ifjj corrected time scale indicative of migration times at which peaks should occur; 
f§ (c) sampling the experimental DNA sequencing data traces at time points 

It determined by the corrected time scale, and 

Q. (d) assigning a base number to each peak found in the experimental DNA 

Iff sequencing data traces based upon the corrected time scale, thereby obtaining information about 

ill the sequence of the target polynucleotide. 

1 12. The method of claim 1 1 , wherein the step of evaluating the reference DNA 

2 sequence data traces includes the steps of: 

3 (i) identifying a plurality of peaks in the reference DNA sequencing data 

4 traces, and creating a data table containing the number of each peak based on the known 

5 sequence of the polynucleotide, and the position of each peak in the reference DNA sequencing 

6 data trace; 

7 (ii) identifying a set of coefficients for a polynomial effective to substantially 

8 linearize a plot of peak number versus separation between adjacent peaks; and 
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9 (iii) creating from the coefficients and the polynomial a corrected time scale 

1 0 which reflects the positions at which a peak should occur at any given point in a sequencing data 

1 1 trace. 

1 13. The method of claim 1 1 , wherein the reference DNA sequencing traces 

2 and the experimental DNA sequencing data trace are derived from analysis of sequencing 

3 fragments in a common sequencing gel. 

1 14. The method of claim 13, wherein the experimental DNA sequencing data 

2 trace and a first reference DNA sequencing data trace are derived from analysis of sequencing 

3 fragments in a common lane of the common sequencing gel. 

ij 15. The method of claim 1 1 , wherein a plurality of reference DNA sequencing 

! 2 data traces are obtained, each derived from the separation of the same set of reference DNA 

|a 3 sequencing fragments. 

15 16. The method of claim 1 1 , wherein the polynomial is a third or higher order 

polynomial. 

1 17. The method of claim 1 1 , wherein a defined number of bands are selected 

2 for evaluation from each of the reference DNA sequencing data traces. 

1 18. The method of claim 17, wherein the defined number of bands selected is 

2 from 3 to 40. 

1 19. The method of claim 1 7, wherein the defined number of bands is at least 

2 equal to the order of the polynomial, plus 1 . 
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1 20. The method of claim 11, wherein base numbers are assigned to peaks 

2 within a plurality of experimental DNA sequencing data traces derived from the separation of 

3 experimental DNA sequencing fragments indicative of the positions of a plurality of types of 

4 bases. 

1 2 1 . An apparatus for evaluating the sequence of a target polynucleotide, 

2 comprising: 

3 (a) an input for receiving information about one or more experimental DNA 

4 sequencing data traces derived from the separation of experimental DNA sequencing fragments 

5 reflecting the position of at least one base in the target polynucleotide and one or more reference 
® DNA sequencing data traces derived from the separation of reference DNA sequencing 

jjf fragments reflecting the position of at least one base in a reference polynucleotide of known 

M sequence; 

%jr (b) a processor, operatively programmed to evaluate the reference DNA 

H& sequencing data traces to determine a corrected time scale indicative of migration times at which 

1A peaks should occur; 

lj|j (c) a processor, operatively programed to sample the experimental DNA 

IS sequencing data traces at time points determined by the corrected time scale; 

ijiji (d) a processor, operatively programmed to assign a base number to each peak 

1 5 found in the experimental DNA sequencing data traces based upon the corrected time scale, 

16 thereby obtaining information about the sequence of the target polynucleotide; and 

17 (e) an output for communicating the information about the sequence of the 

1 8 target polynucleotide. 

1 22. The apparatus of claim 21, wherein the processor programmed to evaluate 

2 the reference DNA sequence data traces is programmed to perform the steps of: 
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3 (i) identifying a plurality of peaks in the reference DNA sequencing data 

4 traces, and creating a data table containing the number of each peak based on the known 

5 sequence of the polynucleotide, and the position of each peak in the reference DNA sequencing 

6 data trace; 

7 (ii) identifying a set of coefficients for a polynomial effective to substantially 

8 linearize a plot of peak number versus separation between adjacent peaks; and 

9 (iii) creating from the coefficients and the polynomial a corrected time scale 

10 which reflects the positions at which a peak should occur at any given point in a sequencing data 

1 1 trace. 
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ABSTRACT OF THE DISCLOSURE 

1 In order to align DNA sequence data traces, an experimental data trace 

2 representing the positions of a first species of base within a target polynucleotide and a reference 

3 data trace representing the positions of a second species of base (which may be the same as or 

4 different from the first species) within a reference polynucleotide are obtained by separating 

5 appropriate sequencing fragments generated from the target and reference polynucleotides on an 

6 electrophoresis gel. For each reference data trace, a plurality of peaks corresponding to 

7 fragments having a size in the range of 40 to 1200 bases are selected. A base number is assigned 

8 to each of the selected peaks in the reference data trace, and a numerical "peak file" is created 

9 with information about the peak number and migration time (or distance). This peak file is 

Iff analyzed to determine a set of polynomial coefficients which will allow substantial linearization 
LI of a plot of peak number versus separation between adjacent peaks and alignment of the traces 
il with respect to each other. These coefficients are used to create a corrected time scale identifying 
O where peaks should be located on a given experimental gel. This corrected time scale is used to 
H guide the sampling of the experimental data, and for assignment of peaks within the data. 
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